Prediction of Petrophysical Properties from Well Logs

Objective

The primary objective of this project is to develop a machine learning data-driven model to estimate reservoir properties, including shale volume, porosity, and fluid saturation, based on a common set of well logs, including gamma-ray, bulk density, neutron porosity, resistivity, and sonic.

Data Preparation

Exploratory Data Analysis

This is done to identify missing values, outliers, trends and anomalies within the data. The goal is to thoroughly understand the data in ordder to further proceed with the analysis.

There are no categorical data types in the dataset; All data are float or int type (i.e. numerical data) Lets further explore

-9999 was present in most of the column data

lets visualize the raw data set.

Log Plot

  1. Plot shows extremely large values and extremely small values. Possible presence of outliers
  2. Also besides gamma ray logs, no deflections were observed for depths from 0 to about 8000ft for the rest of the logs. A possible explanation for this is the fact that log tools are usually deployed in pay zones which are usually a few 1000 feet from surface. We would assume that the actual well logging was started from depths beyond 8000ft.

Let us further explore with boxplots

The simple box plots shows the following;

  1. A large number of negative values in all the log measurements, which does not auger well stemming from domain knowldehge of log values
  2. Significant number of outliers in RDEP, RMED, GR,DTC and DTS logs

Before cleaning the data and treating outliers, let us examine the distribution of the various columns in the given data

Univariate Analysis

Let us discuss inference from the univariate analysis

  1. Well number and well depth are uniformly and normally distributed respectively which is expected as well number is nominal. Also we expect more data to be collected at the depth of interest which in this case is around 5000 to 10000ft.
  2. DTC, DTS, CALI, DEN, DENC, BS, GR, NEU, PEF, RDEP, RMED, ROP, PHIF, SW AND VSH all show bimodality. This means that two different distrubitions are present in the data.
  3. DTC,DTS, DEN, DENC, NEU, PEF and PHIF have a larger proportion of their distribution with average values of -999. This is unsual using domain knowledge as compression travel time, shear travel time, density and an neutron porosity logs do not usally provide such high negative values in oil and gas operations. Assumption: We would assume that this data was obtain from depths where the logging tools were not deployed and which are not of interest to the study.

  4. RDEP and RMED values have averages of -948 and -804 respectively which is unsual for conventional oil formations like the Volve field. Resistivity logs are usally presented on a logarithmic scale from 0.2 to 2000 Ohmm (https://glossary.oilfield.slb.com/en/terms/r/resistivity_log).

Basic Reservoir Engineering

  1. Resistivity of resrvoir formation fluids is essential in determination of water saturation which is in turn used in estimating the hydrocarbon initially in place. This is usually done using the Archie's equation as shown below. The equation also shows that water saturation is related to porosity

Archi'es%20equation.png

  1. Sonic logs, bulk density logs and are aslo used in determining formation porosity and can be shown in the equations below

Porosity estimation from formation bulk density

bulk%20density.png

Neutron_porosity.png

Porosity Estimation from travel time

Sonic_porosity.png

Before we do any form of data cleaning, let us determine if there are any correlations between the individual parameters in the data set

Checking correlations between variables

Caveat: This correlations may be due to anomalies in the data. i.e. -999 values recorded.

Positive Correlation

  1. PHIF---DEPTH, CALI, DEN, DENC, NEU, PEF, VSH, SW
  2. SW---CALI, DEN, DENC, NEU, PEF, VSH
  3. VSH---CALI, DTC, DEN, DENC, NEU, PEF, PHIF, SW

Negative Correlation

  1. PHIF---BS,ROP
  2. SW---BS, ROP

Due to the high level of suspected anomalies in the data, we would not look at collinearity at this stage.

Now let us clean up the data

The statistics of the data looking better.

Let us look at the missing data

Lets visualize the missing data

We have a significant number of missing values

Let us see their percentage

Comments

From the above, we can see that DTS, VSH, PHIF, SW, DTC, DENC, PEF, NEU, DEN and CALI have a great amount of missing values. Hence, we need to consider domain knowledge and dropping of irrelevant columns or columns with little or no relationship with the target variables.

First let us drop all missing values in the tagret variables

Let us create a new dataset

Let us check missing data in the new data set

  1. The number of missing values have reduced significantly.
  2. There are however significant missing values in the compressional and shear travel times which are used in sonic porosity determination
  3. to confirm this, let us heck and see if there is some form of correlation between these logs and the target variables.

BIVARIATE ANALYSIS

Taking some notes from the bivariate analysis

  1. DTC- has a positive correlation with PHIF, and minimal correlation with VSH and SW
  2. DTS- has a slight positive correlation with VSH

Exploring the data further

Obviously DTC and DTS have correlations with the target variables

Let us look at the rawdata1 again

Let us check the percentage of missing values

  1. The number of missing values in the DTS and DTC data is greater than half of the overal data
  2. Removing the missing data would make us lose most of our data

Explore techniques on how to deal with the missing values.

The following techniques would be employed

a. Imputation Technique

b. Using algorithms that accomodate missing values

Data Imputation

Petroleum well logs and petrophysical data are measured in different scales and are highly variable. This is due to the significant diffference in mineralogical constituents of different rock formations encountered during well logging. The use of conventional machine learning imputation techniques like imputing mean or any other measure of central tendency would significantly affect the intergity of the data and model accuracy.

We employ the method used by (Akinnikawe et al, 2018) were we apply simple linear regression methods to generate and inpute synthetic log values in place of the missing values.

Now, I would predict the missing values for DTS, DTC, PEF, DENC and ROP

Methodology

The method employed here is as follows;

  1. Firts, copy rawdata1 into a new dataset rawdata2
  2. select rows within the data with no missing values for training of the model
  3. select all rows with missing values in the target data column but no missing values in the rest of the columns for testing.
  4. check for collinearity and select appropriate features.
  5. Scale data, fit training data to the model and generate synthetic logs using test data

DTS Prediction

Check for collinearity

Almost all the features are highly correlated. However at this stage, we would define a threshold value of 100

Scaling training data
  1. We have successfully replaced all the nan values in rawdata1
  2. Let us repeat the process for DTC

Predicting the values for DTC

Lets repeat the process for corrected density and photoelectric factor

DENC

Repeat for PEF

Prediction of Inputation values for ROP

Before we proceed, Let us go through our data one more time

  1. rawdata: The original data consisting all data points as obtained from the vogue field.
  2. rawdata1: Original data after replacing -999 values with NAN values based on domain knowledge and removing missing values
  3. rawdata2: modified rawdata1 with large missing values inputed by linear regression predictions.

We would create a new dataset with all paramters that affect the target variables based on domain knowledge called rawdata3

Let us visualize rawdata2

  1. We create a new dataset called rawdata3
  2. In this dataset, BS and WELLNUM were left out becuase they do not highly correlate with the target variable
Now let us drop all the missing data in rawdata3, we would store this in a new dataset called rawdata_new

rawdata_new is our new dataset which we would use for training of the model

Lets analyze and visualize our new dataset

  1. RDEP is almost uniformly distributed
  2. RMED and GR are right-skewed
  1. NEU, DTS and DTC are fairly normally distributed
  1. Density log (DEN) is bimodal, Corrected density log (DENC) is fairly normally distributed and Photoelectric factor (PEF) bimodal as well
  1. DTS is failry normal, CALI is right skewed and VSH is right-skewed as well
  1. All four paramters have more than one modes

Boxplots

  1. Plotting all the boxplots does not reveal the true nature of the individual datasets due to large variations in the measurements.
  2. Let us plot them idividually

VIOLIN PLOTS

Pairplots

  1. The distributions plotted above show some significant number of outliers. Let us explore them .

Outlier Detection and Removal

Let us try DB-SCAN

from sklearn.cluster import DBSCAN

DI = rawdata_new.values #returns a numpy array with the underlying data without column names nor indeces.

Create DBSCAN model with parameters

DB_model = DBSCAN(eps=0.5,min_samples=2) #metric= 'euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)

pred_DB = DB_model.fit_predict(DI)

Extract outliers

outlier_index1 = np.where(pred_DB==-1) outlier_values1 = DI[outlier_index1] # The anomalies are predicted with values of -1.

Feature scaling

sc=StandardScaler() DI_scaled1 = sc.fit_transform(DI) outlier_values_scaled1 = sc.transform(outlier_values1)

Apply PCA to reduce the dimensionality

pca = PCA(n_components=2) DI_pca1 = pca.fit_transform(DI_scaled1) outlier_values_pca1 = pca.transform(outlier_values_scaled1)

Plot the data

sns.scatterplot(x=DI_pca1[:,0], y=DI_pca1[:,1]) sns.scatterplot(x=outlier_values_pca1[:,0], y=outlier_values_pca1[:,1], color='r') plt.title("Isolation Forest Outlier Detection", fontsize=15, pad=15) plt.xlabel("PC1") plt.ylabel("PC2") plt.legend(['Inliners','Suspected Outliers'])

DB Scan works best for nominal data

The isolation forest method has shown that indeed we have a significant number of outliers.

  1. We have significant number of outliers in the following logs; NEU, GR, RMED, CALI, VSH, PEF, DENC, DTS and DTC as seen in the plots.

Outlier Removal Techniques

  1. We are going to address two outlier removal techniques and see the effects of both of them in model prediction

a. The Z-score method b. The percentile method.

Z-score

  1. Let us define a new dataset cleaned based on the z-score of all paramters in the data.
We just cleaned up the dataset rawdata_new to contain values with z-score <2.5 for all columns

Let us look at the shapes of rawdata_new and rawd_zscore

Let us see what the pairplots shows

outliers have been significantly removed.

Percentile Method

Test Data Preparation

The data has 11274 rows and 14 columns
Cleaning Test Data
  1. Replace all the -999 values with 'NAN' just like the training datatset

Missing values

  1. Missing values can be found in DTS, ROP, and DTC; and minute missing values PEF,RDEP, RMED, DEN, DENC
  2. We would remove minute missing values, and predict major missing values
  1. We would repeat the same steps as seen in the training dataset for the test dataset

Predicting Missing values for DTS

All the DTS values here are NAN values and these are what is to be predicted
Using a VIF threshold of 100, We take CALI and DEN out of the predictors

Predicting DTC

Using a VIF threshold of 100, We take CALI and DEN out of the predictors

Predicting ROP

Using a VIF threshold of 100, We take CALI and DEN out of the predictors

MODELLING

METHODLOGY

We would address the modelling in three ways using different training sets and identify the effects of different variables.

  1. MULTI-TARGET REGRESSION: We would predict all three target variables at the same time.

Data_Types:

a. We would use rawdata_new (has no missing values but outliers has not been addressed), rawd_zscore: addressed for outliers using the Z-score method, AND rawd_log addressed for outliers with log transformation.

b. We would use data with missing values and an algorithm that accomodates missing values.

In both cases, different machine learning algorithms would be explored.

BOOSTING RESULTS

  1. We would explore Feature Engineering (PCA where necessary) and other techniques to improve or boost the best performing algorithms in a. and b. above

EXTRAS

1- Other approaches would be explored depending on the results achieved.

Regression with rawd_zscore

Clustering

  1. Before we start modelling, we would cluster the logs into Electrofacies using a clustering algorithm. Electrofacies simply refers to numerical combinations of petrophysical logs that reflects rock intervals with specific characteristics. This would help us relate the log measurements to specific rock intervals.

  2. After model development, the predicted values would be appended to the test data set and the clustering done again. A way to check the accuracy of the predictions is to identify distinct electrofacies

The goal is that after clustering, we want to see some sort of clear distinction in the data points in the pairplots

Elbow plot is giving us 2 or 6 different clusters. We would validate this with the silhouette analysis

We are seeing some distiction.
Lets further visualize with other plots
Boxplots
Log Plots
Add Comments
  1. Although the Elbow plot and silhoutte analysis suggest that optimal number of clusters is 5, the clusters are not distinct.
  2. The clusters have different sizes from the silhoutte analysis. Cluster labels 4 and 5 have very few datapoints.
  3. The boxplots are also not distinct.
  4. The boxplot also shows that RDEP and RMED highly affects the clustering.
  5. We would explore taking one of them out as both are resistivity logs. We may also take both of them out based on results obtained.
Lets take RMED out

Hierarchical Clustering

Summary of Clustering Results

  1. Both KMeans and Heirarchical clustering was experimented

KMeans Clustering

  1. A new dataset called "Clustering_Data" was created from rawdata_new consisting of selected well logs. i.e. 'GR', 'NEU', 'PEF', 'RDEP', 'RMED', 'DENC'
  2. The data was scaled with a standard scaler.
  3. For the selected dataset, the elbow plot and silhouette analysis was used in determing the optimal number of clusters.
  4. Five (5) clusters were chosen, however, these clusters were not distinct as seen in electrofacie plots and boxplots. Boxplots significantly overlapped.
  5. KMeans was improved by selecting only signifcant well logs in electrofacie determination. i.e. GR, NEU and PEF. The algorithm had the following parameters (State them).
  6. A more distinct clusers were obtained.

Heirarchical Clustering

  1. HC could not be carried out on the entire dataset due to time and space complexity. HC is computationally expensive.
  2. A stratified sample of the dataset was created consisting of 28694 datapoints from the original dataset.
  3. HC did not do well in clustering the dataset as seen in the dendogram.

REGRESSION

Preparing Training data

Scaling the data

LINEAR REGRESSION

r-squared-adjusted.jpg

K-Nearest Neighbours

Support Vector Machine

Train for Volume of Shale
Train for Volume of Porosity

Train for water saturation

Support Vector Machine with PCA

Train for VSH

Train for Porosity

Train for Water Saturation (SW)

Random Forest

Feature engineering in random forest

Gradient Boosting

Regression Plot Summary

Validation Results

Summary

  1. Overfitting was seen in KNN
  2. The GradientBoost Regression Performed Optimally.
  3. Support Vector Machine did not perform very well.
  4. GradientBoostRegression is Selected as the best Model

Preparing Test Data for Prediction

KNN Model
GBR Model
KNN Model

REFERENCES

  1. Akinnikawe, O., Lyne, S., & Roberts, J. (2018, July). Synthetic well log generation using machine learning techniques. In SPE/AAPG/SEG Unconventional Resources Technology Conference. OnePetro.